A CUDA Kernel Scheduler Exploiting Static Data Dependencies
نویسنده
چکیده
The CUDA execution model of Nvidia’s GPUs is based on the asynchronous execution of thread blocks, where each thread executes the same kernel in a data-parallel fashion. When threads in di↵erent thread blocks need to synchronise and communicate, the whole computation launched onto the GPU needs to be stopped and re-invoked in order to facilitate interblock synchronisations and communication. The need for synchronisation is tightly connected with the underlying data dependency pattern of the computation. For a good range of algorithms, the underlying data dependency pattern is static, scalable and shows some regularity. For instance, sorting networks, the Fast Fourier Transform, stencil computations of PDE solvers are such examples, but parallel design patterns like scan, reduce, and alike can also be considered. In such cases, much of the e↵ort of devising and scheduling CUDA kernels for the computation can be automatized by exposing the dataflow representation of the computation in the program code using a
منابع مشابه
Speckle Reduction in Synthetic Aperture Radar Images in Wavelet Domain Exploiting Intra-scale and Inter-scale Dependencies
Synthetic Aperture Radar (SAR) images are inherently affected by a multiplicative noise-like phenomenon called speckle, which is indeed the nature of all coherent systems. Speckle decreases the performance of almost all the information extraction methods such as classification, segmentation, and change detection, therefore speckle must be suppressed. Despeckling can be applied by the multilooki...
متن کاملA Multi-Stage CUDA Kernel for Floyd-Warshall
We present a new implementation of the Floyd-Warshall AllPairs Shortest Paths algorithm on CUDA. Our algorithm runs approximately 5 times faster than the previously best reported algorithm. In order to achieve this speedup, we applied a new technique to reduce usage of on-chip shared memory and allow the CUDA scheduler to more effectively hide instruction latency.
متن کاملAvoiding Blocking System Calls in a User-Level Threads Scheduler for Shared Memory Multiprocessors
SMP machines are frequently used to perform heavily parallel computations. The concepts of multithreading have proved suitable for exploiting SMP architectures. In general, application developers use a thread library to write such a program. This library schedules threads itself or relies on the operating system kernel to do so. However, both of these approaches pose a number of problems. This ...
متن کاملOptimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators
Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized n...
متن کاملOptimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators
Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming languages (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical k...
متن کامل